异质图具有多个节点和边缘类型,并且在语义上比同质图更丰富。为了学习这种复杂的语义,许多用于异质图的图形神经网络方法使用Metapaths捕获节点之间的多跳相互作用。通常,非目标节点的功能未纳入学习过程。但是,可以存在涉及多个节点或边缘的非线性高阶相互作用。在本文中,我们提出了Simplicial Graph注意网络(SGAT),这是一种简单的复杂方法,可以通过将非目标节点的特征放在简单上来表示这种高阶相互作用。然后,我们使用注意机制和上邻接来生成表示。我们凭经验证明了方法在异质图数据集上使用节点分类任务的方法的功效,并进一步显示了SGAT通过采用随机节点特征来提取结构信息的能力。数值实验表明,SGAT的性能优于其他当前最新的异质图学习方法。
translated by 谷歌翻译
Anticipating future actions based on video observations is an important task in video understanding, which would be useful for some precautionary systems that require response time to react before an event occurs. Since the input in action anticipation is only pre-action frames, models do not have enough information about the target action; moreover, similar pre-action frames may lead to different futures. Consequently, any solution using existing action recognition models can only be suboptimal. Recently, researchers have proposed using a longer video context to remedy the insufficient information in pre-action intervals, as well as the self-attention to query past relevant moments to address the anticipation problem. However, the indirect use of video input features as the query might be inefficient, as it only serves as the proxy to the anticipation goal. To this end, we propose an inductive attention model, which transparently uses prior prediction as the query to derive the anticipation result by induction from past experience. Our method naturally considers the uncertainty of multiple futures via the many-to-many association. On the large-scale egocentric video datasets, our model not only shows consistently better performance than state of the art using the same backbone, and is competitive to the methods that employ a stronger backbone, but also superior efficiency in less model parameters.
translated by 谷歌翻译
Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.
translated by 谷歌翻译
在各种任务和场景中使用多机器人系统的使用越来越兴趣。这种系统的主要吸引力是它们的灵活性,鲁棒性和可扩展性。系统模块化是一个经常被忽视但有希望的功能,它为利用代理专业化提供了可能性,同时还可以实现系统级别的升级。但是,改变代理的能力可以改变最大化系统性能所需的勘探探索示例平衡。在这里,我们研究了群异质性对其探索探索平衡的影响,同时跟踪在对多个移动目标框架的合作多机器人观察下跟踪多个快速移动的回避目标。为此,我们使用分散的搜索和跟踪策略,并具有可调节水平的探索和剥削水平。通过间接调整平衡,我们首先确认这两个关键的竞争动作之间存在最佳平衡。接下来,通过用更快的速度替换较慢的移动剂,我们表明该系统表现出了性能的改进,而无需对原始策略进行任何修改。此外,由于更快的代理商进行了额外的剥削量,我们证明,可以通过降低代理的连接水平来进一步改善异质系统的性能,从而有利于探索性动作的行为。此外,在研究蜂群剂的密度的影响时,我们表明,加快代理的添加可以抵消代理数量的减少,同时保持跟踪性能的水平。最后,我们探索使用差异化策略来利用群体的异质性质的挑战。
translated by 谷歌翻译
在本报告中,我们描述了我们提交的Epic-Kitchen-100行动预期挑战的技术细节。我们的模型,高阶的复发时空变压器和带有边缘学习的消息通讯神经网络都是基于复发的架构,仅观察2.5秒的推理上下文,以形成动作预期预测。通过平均从我们建议的培训管道中编译的一组模型中的预测分数,我们在测试集上实现了强劲的性能,这是19.61%的总平均前五名召回率,在公共排行榜上被记录为第二名。
translated by 谷歌翻译
最近在组合问题中寻找多样化的解决方案,最近受到了相当大的关注(Baste等人2020; Fomin等人2020; Hanaka等。2021)。在本文中,我们研究了以下类型的问题:给出了整数$ k $,问题询问了$ k $解决方案,使得这些解决方案之间的成对和汉明距离的总和最大化。这种解决方案称为各种解决方案。我们介绍了一种用于查找加权定向图中的多样性最短$ ST $ -Paths的多项式时间算法。此外,我们研究了其他经典组合问题的多样化版本,如不同的加权麦芽碱,不同加权树丛和多样化的双链匹配。我们表明这些问题也可以在多项式时间内解决。为了评估我们寻找多样性最短$ ST $ ST -Paths的算法的实际表现,我们进行了合成和现实世界的计算实验。实验表明,我们的算法在合理的计算时间内成功计算了各种解决方案。
translated by 谷歌翻译
New architecture GPUs like A100 are now equipped with multi-instance GPU (MIG) technology, which allows the GPU to be partitioned into multiple small, isolated instances. This technology provides more flexibility for users to support both deep learning training and inference workloads, but efficiently utilizing it can still be challenging. The vision of this paper is to provide a more comprehensive and practical benchmark study for MIG in order to eliminate the need for tedious manual benchmarking and tuning efforts. To achieve this vision, the paper presents MIGPerf, an open-source tool that streamlines the benchmark study for MIG. Using MIGPerf, the authors conduct a series of experiments, including deep learning training and inference characterization on MIG, GPU sharing characterization, and framework compatibility with MIG. The results of these experiments provide new insights and guidance for users to effectively employ MIG, and lay the foundation for further research on the orchestration of hybrid training and inference workloads on MIGs. The code and results are released on https://github.com/MLSysOps/MIGProfiler. This work is still in progress and more results will be published soon.
translated by 谷歌翻译
The receptive field (RF), which determines the region of time series to be ``seen'' and used, is critical to improve the performance for time series classification (TSC). However, the variation of signal scales across and within time series data, makes it challenging to decide on proper RF sizes for TSC. In this paper, we propose a dynamic sparse network (DSN) with sparse connections for TSC, which can learn to cover various RF without cumbersome hyper-parameters tuning. The kernels in each sparse layer are sparse and can be explored under the constraint regions by dynamic sparse training, which makes it possible to reduce the resource cost. The experimental results show that the proposed DSN model can achieve state-of-art performance on both univariate and multivariate TSC datasets with less than 50\% computational cost compared with recent baseline methods, opening the path towards more accurate resource-aware methods for time series analyses. Our code is publicly available at: https://github.com/QiaoXiao7282/DSN.
translated by 谷歌翻译
We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.
translated by 谷歌翻译
Although synthetic aperture imaging (SAI) can achieve the seeing-through effect by blurring out off-focus foreground occlusions while recovering in-focus occluded scenes from multi-view images, its performance is often deteriorated by dense occlusions and extreme lighting conditions. To address the problem, this paper presents an Event-based SAI (E-SAI) method by relying on the asynchronous events with extremely low latency and high dynamic range acquired by an event camera. Specifically, the collected events are first refocused by a Refocus-Net module to align in-focus events while scattering out off-focus ones. Following that, a hybrid network composed of spiking neural networks (SNNs) and convolutional neural networks (CNNs) is proposed to encode the spatio-temporal information from the refocused events and reconstruct a visual image of the occluded targets. Extensive experiments demonstrate that our proposed E-SAI method can achieve remarkable performance in dealing with very dense occlusions and extreme lighting conditions and produce high-quality images from pure events. Codes and datasets are available at https://dvs-whu.cn/projects/esai/.
translated by 谷歌翻译